N-gram Counts and Language Models from the Common Crawl
نویسندگان
چکیده
We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection over 9 billion web pages. This release improves upon the Google n-gram counts in two key ways: the inclusion of low-count entries and deduplication to reduce boilerplate. By preserving singletons, we were able to use Kneser-Ney smoothing to build large language models. This paper describes how the corpus was processed with emphasis on the problems that arise in working with data at this scale. Our unpruned Kneser-Ney English 5-gram language model, built on 975 billion deduplicated tokens, contains over 500 billion unique n-grams. We show gains of 0.5–1.4 BLEU by using large language models to translate into various languages.
منابع مشابه
LanguageCrawl: A Generic Tool for Building Language Models Upon Common-Crawl
The web data contains immense amount of data, hundreds of billion words are waiting to be extracted and used for language research. In this work we introduce our tool LanguageCrawl which allows Natural Language Processing (NLP) researchers to easily construct web-scale corpus the from Common Crawl Archive: a petabyte scale open repository of web crawl information. Three use-cases are presented:...
متن کاملImproved Smoothing for N-gram Language Models Based on Ordinary Counts
Kneser-Ney (1995) smoothing and its variants are generally recognized as having the best perplexity of any known method for estimating N-gram language models. Kneser-Ney smoothing, however, requires nonstandard N-gram counts for the lowerorder models used to smooth the highestorder model. For some applications, this makes Kneser-Ney smoothing inappropriate or inconvenient. In this paper, we int...
متن کاملWeb-based and combined language models: a case study on noun compound identification
This paper looks at the web as a corpus and at the effects of using web counts to model language, particularly when we consider them as a domain-specific versus a general-purpose resource. We first compare three vocabularies that were ranked according to frequencies drawn from general-purpose, specialised and web corpora. Then, we look at methods to combine heterogeneous corpora and evaluate th...
متن کاملN-gram Language Modeling of Japanese Using Prosodic Boundaries
A new method was developed to include prosodic boundary information into statistical language modeling. This method is based on counting word transitions separately for the cases crossing accent phrase boundaries and not crossing them. Since direct calculation of the above two types of word transitions requires a large speech corpus which is practically impossible to make, bi-gram counts of par...
متن کاملFaster and Smaller N-Gram Language Models
N -gram language models are a major resource bottleneck in machine translation. In this paper, we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25% of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google...
متن کامل